DOMAIN: Smartphone, Electronics


CONTEXT : India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.

DATA DESCRIPTION :
• author : name of the person who gave the rating
• country : country the person who gave the rating belongs to
• data : date of the rating
• domain: website from which the rating was taken from
• extract: rating content
• language: language in which the rating was given
• product: name of the product/mobile phone for which the rating was given
• score: average rating for the phone
• score_max: highest rating given for the phone
• source: source from where the rating was taken

PROJECT OBJECTIVE : We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.

Steps and tasks:

1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps.
 • Merge the provided CSVs into one data-frame.
 • Check a few observations and shape of the data-frame.
 • Round off scores to the nearest integers.
 • Check for missing values. Impute the missing values if there is any.
 • Check for duplicate values and remove them if there is any.
 • Keep only 1000000 data samples. Use random state=612.
 • Drop irrelevant features. Keep features like Author, Product, and Score.

  • As we can see from the data, the date fromat is not the same for all the records, we will have to manage that.

  • Some are in dd/mm/yyyy and some are mm/dd/yyyy format.

  • Thus these three features have approx 4.5% missing values 'score' and 'score_max' have exactly same number of missing values.

    Since this is almost 10% of data, we will not drop off the null values for score and score_max column, we will impute them with the median values and drop of the nulls from the other columns.

    Thus, a multiple similar names, with different details exist in product list. For eg:

  • Huawei P8lite zwart / 16 GB and

  • Huawei P8 Lite Smartphone, Display 5" IPS, Processore Octa-Core 1.5 GHz, Memoria Interna da 16 GB, 2 GB RAM, Fotocamera 13 MP, monoSIM, Android 5.0, Bianco [Italia] are exactly same models
    Another observation is that 'phone_url' column also contains the phone name and model information. Let's check what extra information is present in 'product column'

  • Exra information is generally:

  • phone memory: 8Gb/16GB/32GB etc

  • phone colour: Marble white, Blue, Red etc

  • carrier: AT&T, Verizon etc.


    Another observation is that these specifications are not present in all the product names, for eg: there is no-way available to differentiate between the 2 products below: 'Samsung Galaxy S III Cellular Phone' and 'Samsung Galaxy S III SPH-L710 - 16GB - Marble White (Sprint) Smartphone'.


  • Thus differentiating information is not same in all the product details. Also, the goal is to recommend a phone not the carrier. and other specs like color etc are of low importance in recommendation. The only consistent differentiating information in all the product names is the 'phone manufacturer and model number', which can also be extracted from 'phone_url' column. Let's check for other phone names as well

  • As can be seen, same pattern is visible for the most comun types of phones. Thus it is better to use phone name and model number rather than other details mentioned in 'product' column

    Following observations are made:

  • Most active user is 'Amazon customer'

  • 'Anonymous' and 'unknown' users are those whose names are not known. Thus we can use this to impute blank values in 'author' column

  • Many names are similar but in different languages like 'Amazon customer' and 'Cliente Amazon'. Let's search for these first and cleanup the differences due to language, names like 'einer Kundin', 'einem Kunden','Anonymous' and 'unknown' can be interpreted in the same way i.e. an 'unknown customer'. Let's replace these names too

    1. Answer the following questions
      • Identify the most rated features.
      • Identify the users with most number of reviews.
      • Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.
    1. Build a popularity based model and recommend top 5 mobile phones.
    1. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model.

    Collaborative filtering model using SVD

    Collaborative filtering model using kNNWithMeans_Item based

    Collaborative filtering model using kNNWithMeans_User based

    1. Evaluate the collaborative model. Print RMSE value.

    Best RMSE score is given by SVD, so let's use it for further analysis.

    1. Predict score (average rating) for test users.
    1. Report your findings and inferences.
      Most popular phone (rated 10 by highest number of people):
      • Overall: verykool t742
      • Amongst top users: samsung Galaxy Note 5

        Overall data is highly skewed towards 'Amazon customers' from different countries. This may also be because 'Amazon' is the biggest trader for phones in the world. Although correct 'user' names from 'Amazon' should have been used.
        Most of the authors have given the rating of '10' or '8'
        Both knn_i(item-based) and knn_u(user-based) have roughly similar RMSE
    1. Try and recommend top 5 products for test users.
    1. Try cross validation techniques to get better results.

    Thus, for cv scores too, SVD is giving a better performance.

    1. In what business scenario you should use popularity based Recommendation Systems ?
      Popularity based recommendation systems can be useful in multiple scenarios like:
      When there is no data about the user and items.
      When it is required to show most popular items in different categories along with personalized results like:
      Most popular punjabi songs or most popular english songs on a music website/app
      Most popular trend in cwestern wear or traditional wear
      Most popular holiday packages for honeymoon trips, or bike trips or himalayan trips etc
    1. In what business scenario you should use CF based Recommendation Systems?
      Giving personalised recommendation to the user, when user history or item data is available. Some examples can be: Personalized movie recommendation of movie sites like Netflix, Amazon Prime, Youtube etc.
    1. What other possible methods can you think of which can further improve the recommendation for different users?
      Other from Popularity and Collaborative Filtering, hybrid recommendation methods like Content+Collaborative method, Demographic, Utility based, and Knowledge based recommendation system can also be used.